The Annals of Applied Statistics 
2012, Vol. 6, No. 2, 497-520 
DOI: 10.1214/11-AOAS526 

© Institute of Mathematical Statistics. 2012 



A BAYESIAN MODEL AVERAGING APPROACH FOR 
OBSERVATIONAL GENE EXPRESSION STUDIES 

By Xi Kathy Zhou 1 , Fei Liu 2 and Andrew J. Dannenberg 3 

Weill Medical College of Cornell University, IBM Watson Research Center 
and Weill Medical College of Cornell University 

Identifying differentially expressed (DE) genes associated with 
a sample characteristic is the primary objective of many microarray 
studies. As more and more studies are carried out with observational 
rather than well controlled experimental samples, it becomes impor- 
tant to evaluate and properly control the impact of sample hetero- 
geneity on DE gene finding. Typical methods for identifying DE genes 
require ranking all the genes according to a preselected statistic based 
on a single model for two or more group comparisons, with or with- 
out adjustment for other covariates. Such single model approaches 
unavoidably result in model misspecification, which can lead to in- 
creased error due to bias for some genes and reduced efficiency for 
the others. We evaluated the impact of model misspecification from 
such approaches on detecting DE genes and identified parameters 
that affect the magnitude of impact. To properly control for sample 
heterogeneity and to provide a flexible and coherent framework for 
identifying simultaneously DE genes associated with a single or mul- 
tiple sample characteristics and/or their interactions, we proposed 
a Bayesian model averaging approach which corrects the model mis- 
specification by averaging over model space formed by all relevant 
covariates. An empirical approach is suggested for specifying prior 
model probabilities. We demonstrated through simulated microarray 
data that this approach resulted in improved performance in DE gene 
identification compared to the single model approaches. The flexibil- 
ity of this approach is demonstrated through our analysis of data 
from two observational microarray studies. 
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1. Introduction. In recent years, as the rapid advances in biotechnol- 
ogy have markedly driven down the cost of microarray experiments, more 
and more large scale studies are carried out with heterogeneous samples, 
conveniently collected from subjects of different phenotypic characteristics 
and exposure histories. Such microarray studies are considered observational 
rather than experimental in nature [Potter (2003)] because the effects of con- 
founding or correlation in covariates need to be properly handled. The sam- 
ple complexity of such studies presents both opportunities and challenges to 
the analysis. Considering the differential gene expression studies, with multi- 
faceted sample characteristics, one may explore more complex questions that 
are not possible with a more homogeneous sample such as the identification 
of differentially expressed (DE) genes associated with not just one sample 
characteristic but multiple characteristics and/or their interactions. For ex- 
ample, Boyle et al. (2010) investigated DE genes associated with smoking 
as well as smoking x gender interaction. In another study involving smokers 
and never smokers [Carolan et al. (2008)], microarray data were obtained 
for an unbalanced lung airway epithelium sample involving different tissue 
sites from subjects of different gender, age and ethnicity. An interesting 
question is to identify DE genes associated with either a single or multiple 
sample characteristics. To address these questions, one needs to quantify 
the strength of association between the expression of each gene and a set 
of sample characteristics. This differs from the gene set enrichment analysis 
[Efron and Tibshirani (2007); Efron (2010)], where the interest is to quan- 
tify the strength of association between a set of genes and a single sample 
characteristic. Direct application of currently available approaches to these 
questions does not provide a coherent solution and has clear limitations. 

Methods for identifying DE genes are typically based on the ranking of 
statistics for between group differences associated with one sample character- 
istic (also known as a factor or a covariate), such as the t-, ^-statistics, their 
nonparametric counterparts, their modified forms, or the Bayesian versions 
[see Jeffery, Higgins and Culhane (2006) for an excellent review of the various 
approaches]. These methods are suited for well controlled experiments. Their 
lack of control for confounding factors attracts increasing concern when ap- 
plied to observational microarray studies [Potter (2003); Webb et al. (2007); 
Troester, Millikan and Perou (2009)]. With observational samples, the re- 
sults may be confounded by a variety of sample characteristics, such as age, 
sex, genetic profile, exposure and treatment history, etc., which can lead to 
an increased number of false discoveries. Recent studies by Scheid and Spang 
(2007) and Leek and Storey (2007) suggested that hidden traces of unknown 
confounders may exist in DE gene studies and that ranking statistics need to 
be adjusted accordingly. To account for the effects of possible confounders, 
several approaches have been adapted from traditional observational stud- 
ies and applied to microarray data [Smyth (2004); Hummel, Meister and 
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Mansmann (2008)], including adjustment via multiple regression on known 
confounders or on surrogate variables for unknown confounders [Leek and 
Storey (2007)], or via a matched study design [Heller, Manduchi and Small 
(2009)]. 

Regardless of covariate adjustment, the aforementioned approaches rank 
the genes based on the effect sizes estimated using the same model, that is, 
a model with the same structure and same set of covariates, for all genes. 
Such a single model approach can be problematic for high-dimensional mi- 
croarray data because different genes may be involved in different biological 
processes and their expression may be affected by different sets of covariates. 
More specifically, as shown in Section 2, such an approach leads to model 
misspecification for a certain proportion of the genes and does not offer the 
same level of accuracy and efficiency for the effect size estimation for genes 
under investigation. 

To avoid model misspecification in microarray data analysis, an ideal solu- 
tion could be to apply different models to different sets of genes whereby each 
model contains only the set of covariates relevant to the genes it is describing. 
Identifying appropriate models for different sets of genes can be challenging 
because model uncertainty makes it difficult to identify a single best model. 
The Bayesian model averaging (BMA) approach offers an attractive alter- 
native solution to this problem. Hoeting et al. (1999) provides a review of 
this approach in more traditional settings. In recent years, BMA approaches 
have been developed to handle various problems involving high throughput 
genetic data. For example, they were used to improve the assessment of can- 
didate gene effects in the genome-wide association studies [Wu et al. (2010); 
Xu, Craiu and Sun (2011)] and to improve sample classification using gene 
expression microarray data [Yeung, Bumgarner and Raftery (2005)]. They 
have also been shown to improve the DE gene detection in settings where 
the microarray data involved two different distributional assumptions [Se- 
bastiani, Xie and Ramoni (2006)] or were from different sources [Conlon, 
Song and Liu (2006)]. All these approaches are computationally expensive, 
as MCMC simulation is used to obtain estimates of model parameters. In 
this study, we propose a BMA approach for observational microarray studies 
based on linear regression models. It does not require MCMC simulations for 
estimating model parameters and offers a flexible and coherent framework 
to identify simultaneously DE genes associated with a single factor, multiple 
factors and/or their interactions. 

In the next section we discuss limitations of single model approaches. In 
particular, we evaluate the impact of model misspecification from such ap- 
proaches on DE gene finding. We also identify parameters that affect the 
magnitude of impact. In Section 3 we propose to find DE genes with a BMA 
approach that properly controls for sample heterogeneity and model un- 
certainty. In Section 4 we compare the performances of ranking statistics 
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based on a simple model, a complex model and the BMA approach in sim- 
ulated microarray studies. Section 5 concludes with applications of BMA to 
two existing microarray data sets. Our analysis supports the utility of the 
BMA method as a useful tool for capturing and quantifying the complex 
relationship between gene expression patterns and sample characteristics in 
observational microarray studies. 

2. Limitation of the single model approaches. In this section we con- 
sider a general framework to describe gene expression variations in microar- 
rays. Under this framework, we argue that the single model approaches 
to DE gene detection are overly simplified and subjected to the impact 
of model misspecification, for example, the omission of relevant covariates 
when a simple model is used and the inclusion of irrelevant covariates when 
a complex model is used. The consequences of such model misspecification 
have been discussed extensively in the linear regression setting [Rao (1971, 
1973); Rosenberg and Levy (1972)]. The implication of these results, how- 
ever, has not been fully investigated in DE gene studies. In this section we 
evaluate the consequences of model misspecification from the single model 
approaches on performance measures often used in DE gene studies, includ- 
ing the false discovery rate (FDR) and sensitivity. We conclude this section 
with a summary of the main results. 

2.1. Notation. We consider an observational microarray study which 
aims to identify DE genes associated with different values of a factor X\ , for 
example, cigarette smoking exposure. Expression profiles of J genes are ob- 
tained for n subjects with different values of X\ . Without loss of generality, 
a typical model for identifying X\ related DE genes can be written as 

(2.1) yij = Poj + PijXu H h PkjXki + Vij 

or 

(2.2) = a j + a lj x li -\ h a kj x ki + a( k+1 yX( k+1 y + , 

where is the normalized and typically log-transformed expression level 
of gene j in subject i; xn is the factor level for X\ in subject i; X2i, ■ ■ ■ ,x k i 
are levels for other factors, denoted by X2, ■ ■ ■ ,X k , that affect the expres- 
sion of all the genes, for example, experimental parameters involved in the 
microarray experiments; x^+i)i is the level of a potential confounding fac- 
tor Xk+i, for example, gender, age, race, alcohol exposure, etc.; and £ij 
denote normally distributed random errors. 

To identify DE genes related to X\, p- values based on i-statistic of esti- 
mate of either (3±j or ay can be used as the ranking statistics. If model (2.1) 
is used, the relevant t-statistic for gene j is tjvfi,ij = Pij/ 3 d($ij), where j3\j 
is the least square estimate of (3u. If model (2.2) is used, the f-statistic for 
gene j is calculated as tM 2 ,ij = otij / sd(a\j) . It can be shown that the two 
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statistics are related as follows: 

/r, n\ , S'l-23-fe , , 'S'fc+l-l-fc^+l.l^+l^' 

[2.6) t Miili _- tM 2 ,ljH -j— , 

Jl-23-fc+l sd(Pij) 

where b k+ \ and efc + i are the residual sum of squares, least square 

parameter estimates and residual, respectively, from the following auxiliary 
regression equation: 

(2.4) X k+1 =Xb k+1 + e k+1 , 

where X = (X±, . . . ,X k ). Sf, 2 3... k k+i ls * ne residual sum of squares for the 
auxiliary regression with X\ as the outcome and X2, ■ ■ ■ , X k+ \ as the covari- 
ates. 

For an observational microarray study, such single model approach with 
or without covariate adjustment has an intrinsic limitation, that is, neither 
model can be the true model for all the genes. For the aforementioned hy- 
pothetical microarray study, model (2.1) is the true model only for genes 
not related to X k+ i (X k+ i null genes, or M\ genes), and model (2.2) is the 
true model only for genes related to X k+ i (X k+ i DE genes, or M2 genes). 
Based on these considerations, a multi-model approach that uses p-values 
of £mi,i- to rank the M\ genes and p-values of £m 2 ,i- to rank the M2 genes 
is preferable. 

The performance difference between the single model and the multi-model 
approaches can be compared by utilizing the relationship between the two 
t-statistics. Let Fi(t) and -F2W be the density distributions of the ranking 
statistics £mi,i- an d iM 2 ,i-> respectively. Under the multi-model approach, 
the density distribution of the ranking statistics can be written as 

F(t) = (l-f)F 1 (t) + fF 2 (t), 

where / is the proportion of M2 genes. F\{t) and F 2 (t) can further be written 

as 

F 1 (t) = (l- Pl )F 10 (t)+ Pl F 11 (t), 
F 2 (t) = (l-p2)F 20 (t) + P 2F 2 i(t), 

where p\ and P2 are the proportions of DE genes in M\ and M2 genes, F.o(t) 
and F.i(t) are distributions of the test statistic for the null and DE genes, 
respectively. For a given cutoff c> 0, the false discovery rate and sensitivity 
can be calculated as 

(1-/)(1-Pi)[l-F 10 (c)] 



FDR(c) 



(2.5) 



and 



(l-f)[l- Fl (c)] + f[l-F 2 (c)] 
f(l-p 2 )[l-F 20 (c)} 



(l-f)[l- Fl (c)]+f[l-F 2 (c)] 
5(c) = 2(1 - f) Pl [l - F n (c)} + 2f P2 [l - F 21 (c)}. 
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We discuss the impact of the two single model approaches on the FDR and 
sensitivity separately. 

2.2. Single model without covariate adjustment. When model (2.1) is 
used, the FDR can be written as 



FDR Ml (c 



(1-/)(1- Pl )[l-F 10 (c)] 
(l-f)[l- Fl (c)] + f[l-F 2 M ^c)] 

/(l-p2)[l-iffi( C )] 

(l-f)[l- Fl (c)] + f[l-F 2 M ^c)] 

The sensitivity can be written as 

S Ah (c) = 2(1 - f) Pl [l - F u (c)] + 2/p 2 [l - F 2 f(c)}. 

Superscript Mi is used to denote that the distribution of i-statistic is derived 
from model (2.1), which is misspecified for the M2 genes because of omitting 
relevant covariate 

Omission of relevant covariate leads to bias in the model parameter esti- 
mates [Rao (1971)]. Specifically, the bias can be written as 

(2.6) Bias0ij) = E{S^l l , 1 ,„ k b k+1 ^el +l Y j ) = a k+lJ ■ b k+ljl , 

where frfc+i,i.23---fc is the least square estimate of the parameter associated 
with X\ in the auxiliary regression (2.4). Therefore, we have for the M2 
gene j 

bk+i,ia>k+i 



S 



E{tM 2 ,lj) + 



0~2j/Sl-23-k+l. 

It is known that Sl 23 ... k k+l < Sl 23 ... k . 

For the M2 DE genes, because tM 1; ij can be greater or less than tM2,ij 
depending on the values of a\ and Bias0i), F 2 ^{t) is unlikely to be sys- 
tematically different from F2\{t) and results in great changes in sensitivity. 

However, for the M2 null genes, the above results indicate E\tM!,ij\ > 
E\tM 2 ,ij\t that is, the distribution of tM 2 ,ij f° r the M2 null genes moves away 
from zero. Hence, 1 — F 2 ^ (c) > 1 — F20 (c). Let a and b be the denominator 
and numerator of FDR(c) as written in (2.5), respectively. Let 5 be the 
difference between the numerators of FDR 11 (c) and FDR(c), that is, 

6 = f(l- P2 ){[l-F 2 f(c)]-[l-F2o(c)]}, 
and 5' be the difference between the denominators of the two FDRs, 

6' = /(l - P2){[1 - F^(c)] - [1 - F 20 (c)]} 

+ f P2 {[l-F 2 f(c)]-[l-F2 1 (c)]}. 

As discussed above, [1 — F^f x (c)] is comparable to [1 — i*2i(c)] because 
the bias is unlikely to lead to systematic difference between F^f 1 ^) and 
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i*2i(£)- Additionally, p2 generally is much smaller than l—p2 in microarrays. 
Therefore, 8' « 8 and FDR Ml (c) can be approximated by (6 + 8) /(a + 8). 
Since (6 + 8) /(a + 5) > b/a for any a > b > and 5 > 0, this indicates 
FDR AIl (c) > FDR(c), that is, increased FDR with this single model ap- 
proach. 

2.3. Single model with covariate adjustment. When model (2.2) is used, 
the FDR and sensitivity at a given cutoff can be written as 

FDR M Hc) = (WXl-PiHl-ffifrQ] 

(l-/)[l-F^(c)]+/[l-F 2 ( C )] 

f(l- P2 )[l-F 20 (c)] 



+ 



(l-/)[l-i^( C )]+/[l-F 2 (c)] 
and 

5 M2 (c) = 2(1 - f) Pl [l - F^(c)} + 2/paH - F 21 (c)], 

due to the potential change in the distributions of test statistics for the M\ 
genes. The relationship of the two t-statistics can be written as 

. Sl-23---fc+l , . ^fc+l-l---fc^+l,l e fc+l^i 
tM 2 ,lj — —5 tMi.lj H — v • 

It is known that, with the inclusion of an irrelevant covariate, model (2.2) 
does not result in a biased parameter estimate for the M± genes. However, 
since sd{j3\j) < sd{a\j) in general, E{\tM l ,i\) > -^(|*m 2 ,i|) f° r Mi DE genes. 
Therefore, the distribution F^ 2 (t) moves toward and results in S M2 (c) < 
S(c), that is, reduced sensitivity in detecting DE genes in Mi genes. As 
|*Afi,i I in general is likely to be greater than |ijwr 2 ,i | , F^ 2 also shrinks to- 
ward 0. It is likely that FDR M2 (c) will be comparable to FDR Ml (c). Hence, 
reduced sensitivity in detecting DE genes in Mi genes will be the main 
consequence resulted from applying the complex model for all the genes. 

2.4. Summary. The above results suggested that the single model ap- 
proaches with or without covariate adjustment can lead to inferior perfor- 
mance. It is expected that the impact on FDR and sensitivity could be 
greater if more Xfc + i-like covariates exist in the sample. These results will 
be further demonstrated in the simulation study. The above discussion also 
suggested that the performance for DE gene detection can be improved by 
applying the correct model for the right sets of genes. Yet, such knowledge 
is commonly not available beforehand. In the following section, we propose 
a BMA approach as a practical substitute for the multi-model approach for 
DE gene detection that takes into account both sample heterogeneity and 
model uncertainty. 
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3. A Bayesian model averaging approach. In this section we discuss an 
efficient Bayesian model averaging approach to identifying DE genes associ- 
ated with a covariate of interest. The methodology proposed in this paper 
is closely related to methods discussed in Liang et al. (2008) and we largely 
follow their notation. Consider a series of possible models for describing the 
expression pattern of each gene. Let 7 = (71,..., 7^) be a binary vector 
of length K, with each element indicating the inclusion status of the kth 
covariate in the model, that is, 

' 0, if fa = 0, 
1, iffl^O. 

Each model in the model space can then be labeled by 7, namely, .A/f 7 . For 
gene j, j = 1, . . . , J, the model can be written as 

M 7j : Yj = a 7j l n + X 7 /3 7j + N(0, 4>~]l n ), 

where a 7 j is the intercept term; X 7 is the submatrix of X consisting of 
columns associated with nonzero jk Pjj and 4>*yj are parameters under this 
model. 

The marginal posterior inclusion probability for variable Xk and gene j, 
is then defined as 

(3.1) P kj = P( lkj + 0|Y,) = h kj =i x PiM^Yj), 

1 

which is the sum of posterior probabilities of all models that include the 
covariate of interest. It quantifies the strength of association between co- 
variate and the expression level of the jth gene and can be used to rank 
the DE genes. 

The posterior model probability for A4jj can be calculated based on Bayes 
factors of pairs of models, for example, 

(3 2) P(M \Y)~ P(M 13 )BF{M 13 :Mo 3 ) 

where p(A4jj) is the prior model probability for genes measured in the 
microarray experiment and the Bayes factor BF(M-yj :A4oj) is defined as 



BF{M lj :M j) 



/(Yil-Mo;)' 

that is, the ratio of marginal likelihood under A4-yj and the base model, Moj- 
Here the null model (i.e., the model with only the intercept term) is used as 
the base model. For Mjj, the marginal likelihood is obtained by integrating 
out the model parameters from the joint posterior probability 

f(Y j \M jj ) = j /(Y J |0 7i )vr(0 7J |A^ 7J )d0 7J , 

where 7 j = (a 7J - , /3 7J - , </> 7 j ) , and 7r(0 7 j |-M 7 j) is the prior of model param- 
eters under M.-yj. 
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To determine the Bayes factor, proper priors, 7r(0 7J -|.M 7 j), are needed. 
We utilized the Zellner-Siow prior for model parameters [Zellner and Siow 
(1980)] in our study. Liang et al. (2008) have shown that this prior re- 
solves several consistency issues associated with fixed g-priors while retaining 
several attractive properties such as adaptivity, good shrinkage properties, 
robustness to the misspecification of g and fast marginal likelihood calcu- 
lation. When comparing two nested models as in our case, a flat prior is 
placed on common coefficients, (a 7 j, <^ 7 j), where 7r(a 7 j, </> 7? -|.M 77 -) oc l/0 7 j, 
and a Cauchy prior on the remaining parameters, /3 7J -. The multivariate 
Cauchy prior can then be represented as a mixture of g-priors with an Inv- 
gamma(l/2, n/2) prior on g, that is, 



The Bayes factor in equation (3.2) can be written in closed form as 



the ordinary coefficient of determination of this model. This quantity can 
be obtained through direct numerical integration or through the Laplace 
approximation. 

In addition to the prior 7r(0 7? -|.M 7 j) on model parameters, one must also 
choose a prior on the models themselves, which relates directly to multi- 
plicity. Scott and Berger (2010) discussed several prior model probability 
choices regarding their effects on multiplicity-control for multiple models 
in a conventional Bayesian model selection/averaging setting involving one 
outcome variable. With the high throughput data, typically, the prior model 
probabilities should reflect our prior belief about the distribution of the 
models among the genes in the transcriptome, which can be difficult to 
quantify. A uniform prior assumed equal probabilities of the models can 
be problematic when thousands of genes are evaluated simultaneously be- 
cause it puts an unrealistically low weight to the null model. When the 
resulting posterior model probabilities are used to estimate the posterior 
expected FDR (peFDR) [Newton et al. (2004)], great underestimation can 
occur [Sartor et al. (2006); Cao et al. (2009)]. Correctly estimating FDR un- 
der the Bayesian framework remains an active research field [Efron (2008)]. 
Recent discussions and attempts have largely been focused on statistics de- 
rived from single model approaches [Miiller, Parmigiani and Rice (2007); 
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Cao and Zhang (2010)]. In our case, proper control for multiplicity derived 
from multiple genes and multiple models becomes even more challenging. 

We believe that the prior should lead to a reasonably well calibrated pos- 
terior model probability that measures the model's ability for describing the 
data. We propose an empirical approach to obtain estimates for the prior 
model probabilities, p(A4jj), under the assumption that the prior probabili- 
ties of a given model are the same across genes, that is, p(Ai-yj) = p(M~f). We 
first estimate the proportion of DE genes described by a nonnull model 7, 
uj-y, using Bayes factors. Since BF(A4~ i : M.q) > c, c > 1 suggests evidence 
against the null model [Kass and Raftery (1995)], we can estimate w 7 as 
follows: 



where BFj is a vector of null-based Bayes factors for gene j. Therefore, w 7 
represents the proportion of genes for which model 7 is the best model in 
terms of Bayes factors. Given that Bayes factors based on the Zellner-Siow 
prior are consistent for model selection whether or not the true model is null 
[Liang et al. (2008)], this estimator is a consistent estimator of the proportion 
of genes expressing in a pattern specified by the model. In our simulation 
studies, we found that fixing c at 1 resulted in w 7 being close to the truth 
in most settings. Second, we argue that if the prior model probabilities, 
p(M-y), result in the equality between the overall peFDR under A4-y and 
1 — ujy, reasonable calibration of the posterior model probabilities can be 
achieved. This suggests the following relationship between p{M.- i ) and iOy, 



Hence, p{M~ i ) can be obtained by iteratively updating the following equa- 
tion: 



In our experience, 30 iterations were adequate to result in convergence. At 
present stage, theoretical justification for this prior choice for multiplicity 
control is still lacking. We resort to the simulation study to show that this 
prior choice led to improved performance in both the ranking of the genes 
and in direct FDR estimation compared with the uniform prior. 

4. Simulation study. Simulation studies were designed to compare the 
single model approaches with and without covariate adjustment and the 
"gold standard" multiple-model approach with the correct covariate adjust- 




that is 




{V] _ Ej ^jBFjM-tj : Mo j )=max(BF j )] ' l[BF(M^j : M 0j )>c] 
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ments, as well as the performance of BMA over single-model approaches 
when a multi-model approach is appropriate. Bias and efficiency in each ap- 
proach and sensitivity to the choice of prior on the set of models also will 
be discussed. 

4.1. Simulation of microarray data. The microarray data were simulated 
to mimic an observational study for identifying genes associated with a bi- 
nary variable, for example, the smoking status (s), in a sample with two 
potential confounders, gender (g) and heavy alcohol drinking (d) which are 
also binary. A detailed data generation scheme for the subject characteris- 
tics is provided in Zhou, Liu and Dannenberg (2012). Marginally, half of the 
subjects are assumed to be females, smokers or heavy drinkers. We also as- 
sume complex correlation among these covariates. First, s is correlated with 
both g and d. Specifically, in smokers, 75% are males and 80% are heavy 
drinkers; while in nonsmokers, 25% are males and 20% are heavy drinkers. 
Second, g is also correlated with d. Specifically, 75% of male subjects are 
heavy drinkers, while 25% of females are heavy drinkers. Proportions of sub- 
jects in groups defined by categories of the covariates are provided in Zhou, 
Liu and Dannenberg (2012). Each microarray data set consists of the ex- 
pression of 10,000 genes from n subjects. Gene expression for each subject 
was simulated based on the following model: 

Vij = PijSi + fcjgi + fcjdi + Eij, 

where /?.,- takes either or nonzero values generated from normal distri- 
butions with variances generated following procedures similar to that de- 
scribed by Smyth (2004). Detailed procedures for generating the simulated 
microarray data are provided in Zhou, Liu and Dannenberg (2012). Each 
simulation setting was characterized by values of the following parameters: 
f s , f g and fd, the proportion of genes affected by smoking (s), gender (g) or 
heavy drinking (d), respectively, and n, the sample size. Both moderate and 
relatively large sample sizes were considered, n = 40 and n = 80. For each 
setting, we simulated 10 microarray data sets. The reported results were 
averaged over the results obtained for each data set. 

4.2. Performance of the single model approaches. In this section we com- 
pare the performances of three single model approaches that differed by 
covariate adjustment, that is, without covariate adjustment (SMi), with 
adjustment to g and d (SM2), and with adjustment to surrogate variables 
of g and d (SVA) [Leek and Storey (2007)], and that of the gold standard 
multi-model approach (MM) where the DE genes were fit with their respec- 
tive true models, that is, the adjustment for g and/or d is applied only to 
genes truly affected by g and/or d. The sensitivity and FDR corresponding 
to the ranking statistic, p-value of s, were obtained for each method. To 
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Table 1 

False discovery rate (FDR) and sensitivity (S), in %, among the top smoking related 
genes identified with a p-value cutoff of 0.001 using ranking statistics based on the single 
model approach without covariate adjustment (SMi), the single model approach with 
covariate adjustment (SM2), the surrogate variable analysis approach (SVA) and the 
"gold standard" multi-model approach (MM). FDR and sensitivity arising from gOdO 
genes (i.e., genes not associated with d and g) were included. Microarray data sets were 
simulated based on various settings defined by proportion of genes associated with each 
covariate: f a , f g , fd, and the sample size n 
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show the interplay of bias and efficiency on these performance measures, we 
also quantified the contribution to these measures from genes not associated 
with g and d, denoted as gOdO genes. 

Table 1 shows the performance difference between the single and multi- 
model approaches among top ranked genes identified with a p-value cutoff of 
0.001. We can see that, as discussed in Section 2, SM% led to large increase 
in total FDR compared to MM. The magnitude of difference increased with 
the sample size, the proportion of the genes associated with the confounder 
and the number of the confounders. On the other hand, the difference in 
FDR contributed from the gOdO genes remained small. Hence, the results 
suggested that bias in effect estimation among genes associated with the con- 
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Table 2 

Power of different methods for identifying genes differentially expressed between smokers 
and nonsmokers at 5% FDR under different simulation settings 
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founders was the main cause for the FDR increase. SM2 and SVA showed 
slightly greater FDR compared to MM. This increase came mainly from 
gOdO genes and suggested that the effects of the efficiency loss could have 
a negative impact on the total FDR, particularly in small sample size set- 
tings. A more notable limitation of SM2 and SVA was the loss of sensitivity. 
Compared to MM , the magnitude of sensitivity loss increased slightly with 
sample size and the number of confounders. 

4.3. Performance of the BMA approach. In this section we examine the 
performance of the proposed BMA approach in comparison with the single 
model and the gold standard multi-model approaches. To evaluate the effects 
of prior choice on the performance of the BMA approach, we considered 
three prior model probability choices: the proposed empirical prior obtained 
using the two step approach (BMA\), the uniform prior (BMA2), and the 
true proportion of genes for each model (BMA3). The posterior inclusion 
probability of s was used as the ranking statistics. The number of genes 
identified by each methods at 5% FDR were compared in Table 2. We can 
see that the BMA approaches had greater power in detecting DE genes 
compared to the SM approaches in general and the performance came close 
to that of the MM approach. In fact, in all the simulated settings, the BMA 
approaches, particularly BMA\ and BMA3, showed sensitivity close to the 
MM approach for a given FDR threshold and greater than the single model 
approaches. Figure 1 showed the magnitude of performance difference in 
two representative settings. The BMA approaches appeared to be relatively 
insensitive to the choice of prior model probabilities for gene ranking. 

Besides providing proper ranking of the gene, it is often useful to esti- 
mate the FDR of the finding and quantifying the proportion of DE genes in 
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the transcriptome. Therefore, we also evaluated how well the FDR could be 
estimated based on the ranking statistics. For the p-value based approach, 
FDR and the proportion of DE genes were estimated using the approach by 
Storey (2002) and Storey and Tibshirani (2003). For the Bayesian model av- 
eraging approach, the peFDR was directly estimated based on the posterior 
inclusion probability [Newton et al. (2004)], that is, 

P eFDR k (p) = 5^(1 - P kj ) • l[fjy<p]/J] l[f)y<p], 
3 3 

where < p < 1 and Pkj is the posterior inclusion probability of variable k 
for gene j. Figure 2 shows the estimated FDR vs. the true FDR in two repre- 
sentative settings. We can see that using p-values from SM\ in studies with 
confounder associated genes, the estimated FDR was smaller than the true 
FDR. The magnitude of underestimation increased with the sample size and 
the proportion of the confounder associated genes. On the other hand, the 
FDR estimated using p- values from SM2, SVA or MM was very close to the 
true FDR. The accuracy of the peFDR, as observed by other researchers, ap- 
peared to be sensitive to the prior choice. peFDR obtained based on BMA2, 
the Bayesian model averaging approach with uniform prior can greatly un- 
derestimate the FDR. peFDR obtained based on BMA\ showed improved 
accuracy in FDR estimation. The results from our simulation also suggest 
that the peFDR based on BMA\ are close to true FDR in all simulated set- 
tings. BMA% appeared to result in peFDR that slightly overestimated the 
FDR. Level of sensitivity of the BMA\ approach to the choice of c and model 
space misspecification can be found in Zhou, Liu and Dannenberg (2012). 
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We also carried out two additional sensitivity analyses related to the BMA 
approaches using the simulated microarray data [see Zhou, Liu and Dannen- 
berg (2012) for detail]. First, we investigated the sensitivity of the perfor- 
mance of the empirical BMA approach to the choice of the cutoff c. The 
results suggest that the BMA approach with empirical prior is relatively 
robust in gene ranking with respect to the value of c. Second, we investi- 
gated the performance of the BMA approach to the misspecification of model 
space, that is, omission of an important covariate d. As expected, there is 
a decrease in ranking performance, but the BMA approach still outperforms 
all the single model approaches. It is possible to avoid the performance 
loss due to omission of important covariates by introducing the surrogate 
variables [Leek and Storey (2007)] into the models. However, including the 
surrogate variables in the BMA approach is not a trivial extension due to 
model uncertainty, and it is definitely an interesting future research topic. 

5. Application to the observational micorarray data sets. We applied 
the BMA approach to two smoking related observational microarray studies. 
Through the application, we intended to demonstrate the complex relation- 
ship between the gene expression pattern and sample characteristics and the 
flexibility of the BMA approach in capturing and quantifying such relation 
in a unified and coherent framework. 

5.1. Microarray study of airway epithelium samples. The first data set 
(GSE10006) came from a study with a total of 87 current and never smokers 
[Carolan et al. (2008)]. The microarray analyses were carried out on airway 
epithelium samples from these subjects. The data were preprocessed with 
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the Affymetrix MAS method. After excluding gene probe sets whose ex- 
pression measurements were deemed absent or marginal among all subjects, 
the remaining data consisted of expression profiles of 44,085 probe sets of 
genes from the Affymetrix HGU133plus2 chip for each subject. Among these 
probe sets, 34,614 were annotated for probing the expression of 17,690 genes. 
About half of these genes were probed by multiple probes. To eliminate the 
potential dependence issue, average expression measurements were obtained 
for genes with multiple probe sets. We analyzed the expression data of the 
17,690 genes from 60 healthy subjects. Individuals with known lung disease 
were excluded. Besides smoking status, information on age, gender, race 
and site of the tissue was available. The samples were heavily unbalanced, 
the proportion of smokers was greater in female participants than in males 
(86% vs. 57%), the proportion of large airway samples was slightly larger 
in females than in males (57% vs. 46%), and the proportion of Caucasian 
participants was larger in females compared to males (43% vs. 37%). 

With five covariates, a total of 2 5 models were included in the model 
space. Interaction terms were ignored. The BMA approach allowed for si- 
multaneous assessment of the association between the gene expression and 
each of the sample characteristics, and straightforward estimation of both 
the total proportion of the DE genes in the transcriptome and the propor- 
tion of DE genes associated with each covariate based on Bayes factors. The 
application showed a complex picture of the expression pattern in the ep- 
ithelium microarray study. A total of 69% of the genes were estimated to be 
differentially expressed. The estimated proportions of DE genes for associa- 
tion with smoking, site, gender, race and age were 19%, 34%, 6%, 6% and 
4%, respectively. By controlling the peFDR at 5%, we identified a number of 
DE genes associated with smoking (928), site (3089), gender (73), race (33) 
and age (7). The complex expression patterns were illustrated in Figure 3 
where we show the expression pattern of the top 20 genes associated with 
smoking, gender, site and race, respectively. 

The results also revealed complex roles of some of these DE genes which 
were strongly associated with multiple sample characteristics. For example, 
among the top 928 smoking related DE genes, 343, 18 and 6 of them were also 
identified as hits for association with tissue site, gender and race, respec- 
tively. Additionally, there were 25 genes identified as hits for association with 
three or more sample characteristics, mostly smoking, site and gender. The 
BMA approach allows for assessing jointly genes' association with multiple 
sample characteristics. For example, the joint posterior inclusion probability 
of smoking, site and gender can be obtained by summing over the poste- 
rior probabilities of models containing all three covariates. peFDR can then 
be derived similarly using this posterior inclusion probability. The analy- 
sis identified 6 genes, NRARP, TMEM178, UGT1A®, UGT1A1, UGT1A3 
and UGT1A6, as hits for joint association with the three characteristics at 
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Fig. 3. Gene expression intensities for the top 20 genes associated with each of the 
four covariates (smoking, gender, site and race) identified by using BMA±. Labels along 
the x-axis show the characteristics of a sample subgroup. From top to bottom, the label 
represents categories of race (Others vs. White; O vs. W), site (Large airway vs. Small 
airway; L vs. S), gender (Male vs. Female; M vs. F) and smoking status (Non-Smoker 
vs. Smoker; NS vs. S). For example "OLMNS" indicates the subgroup with the following 
characteristics: Other races (i.e., nonwhite), Large airway sample, Male, Non-Smoker. 



5% peFDR. The existence of such genes suggested a connection between to- 
bacco smoking and the functions of these genes which were partly revealed 
through their association with the phenotype of the subjects from whom 
samples were obtained. Results from such analysis offer additional impor- 
tant information that is useful for generating new hypotheses and insights 
into the effects of tobacco smoke on the transcriptome. 

As discussed in the previous sections, given the existence of genes associ- 
ated with various sample characteristics, single model approaches were sub- 
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jected to the effects of increased bias or reduced power in unbalanced study 
design. For the epithelium microarray data, we saw large differences in gene 
rankings derived from the BMA approach and the three single model ap- 
proaches. Among the top 1000 smoking related DE genes identified by each 
method, the agreement was merely 19.7% among all four methods. Specifi- 
cally, the SVA approach produced gene lists that were vastly different from 
the gene lists produced by the other approaches, where more than half of 
the top 1000 genes had ranks beyond 1000 by the other three methods [see 
the Venn diagram in Zhou, Liu and Dannenberg (2012)]. Careful exami- 
nation of the gene lists produced by the SVA approach suggested possible 
effects of overfitting as the SVA approach adjusted for a total of 10 surrogate 
variables for each gene. The agreement was about 56% for the SM±, SM2 
and BMAi approaches, that is, 56% were ranked within the top 1000 by all 
three methods. The agreement between BMA\ and each of the single model 
approaches (SMi, SM 2 and SVA) was 85%, 64% and 35%, respectively. 
These differences were driven by the genes whose expression patterns were 
not properly captured by the single model. The higher agreement between 
results from BMA\ and SM\ reflects the fact that a large proportion of the 
smoking related DE genes are associated with smoking only. 

5.2. Microarray study of oral mucosa samples. The second data set came 
from a study with a total of 79 age and gender matched healthy smokers and 
never smokers [Boyle et al. (2010)]. The microarray analyses were carried 
out on oral mucosa samples obtained from these subjects through buccal 
biopsies. The preprocessed microarray data consisted of 24,103 probe sets of 
genes from the Affymetrix HGU133plus2 chip for each subject. Among these 
probe sets, 22,004 were annotated for probing the expression of 12,911 genes. 
About 43% of these genes were probed by multiple probe sets. To eliminate 
the potential dependence issue, average expression measurements again were 
obtained for these genes. The analysis was carried for the expression data of 
the 12,911 genes. For subjects recruited for this study, information regarding 
age, gender and smoking status was available. 

The study samples were balanced in terms of gender between smokers and 
nonsmokers. Therefore, single model approaches with or without adjustment 
for gender would provide similar results. However, one interesting biological 
question was whether there were genes affected by smoking differently be- 
tween the males and females. In this context, direct application of the single 
model approach could lead to confusing results. For example, at 5% esti- 
mated FDR, the single model without adjustment for the interaction term 
resulted in 944 hits for association with smoking, while the model adjusted 
for both gender and gender x smoking interaction led to the identification of 
only 1 gene as hits for association with smoking and no genes were identified 
as hits for smoking x gender interaction. Such large difference in DE gene 
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assessment between different models is difficult to reconcile and interpret 
under the single model framework. Yet, such difference can be expected if 
there are genes associated with the interaction because the two variables, 
smoking and smoking x gender interaction, are correlated. Joint testing of 
the effects of smoking and smoking x gender interaction led to the identifi- 
cation of 311 DE genes with the likelihood ratio test. However, this method 
can not quantify the relative contribution from the two variables. We there- 
fore applied the BMA approach to these data to illustrate the flexibility and 
usefulness of this approach to handle possible interaction effects. 

In this application, the model space consists of a total of 16 models includ- 
ing the null model, three models with smoking and/or gender as main effects 
only and 12 models for different patterns that could arise from interaction 
between smoking and gender. For the oral mucosa data, our analysis esti- 
mated that about 22.5% of the genes are differentially expressed, in which 
about 12.3%, 1.5% and 8.6% were associated with smoking, gender and 
smoking x gender interaction, respectively. Controlling the peFDR at 5%, 
our approach identified a total of 414 genes as hits associated with smoking 
through either the main effect, the interaction effect or both. Specifically, 
222 of these genes were associated with smoking primarily through the main 
effect, 2 were associated with smoking primarily through the interaction ef- 
fect, while for the rest of these genes various degrees of association were 
contributed from the interaction term. 

By comparing the smoking related DE genes identified by the single model 
approaches and the BMA approach, we noted that the difference was mainly 
from genes that were over /under expressed in only one subgroup of the 
subjects, female smokers. Neither the model with smoking status as the 
only covariate nor the full model adjusted for both the gender and the 
smoking x gender interaction were able to adequately capture the strength 
of association for this group of genes and properly rank them due to either 
increased bias or decreased power. Table 3 showed the posterior inclusion 
probabilities and ranks based on different approaches for a few of these 
genes. A large difference in the rankings by different methods can be seen. 

6. Discussion. In the past decade, microarray technology has greatly in- 
creased our ability to simultaneously interrogate the expression of tens of 
thousands of genes. Use of this technology has contributed to an improved 
understanding of the molecular basis of various diseases. As one of the pri- 
mary tools for such studies, methods for finding DE genes have also been 
refined over time. Various approaches have been proposed to deal with mul- 
tiple issues in microarray data. Yet, from the modeling perspective, many 
approaches have ignored sample heterogeneity, its impact on the analysis 
results, and the great opportunity it presents. Since Potter (2003) discussed 
the need for controlling bias and confounding in observational microarray 
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studies, it has been increasingly recognized that the lack of control for sample 
heterogeneity could be a barrier to the reproducibility of the study findings. 
In two editorials [Webb et al. (2007); Troester, Millikan and Perou (2009)], 
improved data analysis methods and better study design have been consid- 
ered crucial for advancing the field of cancer epidemiology with microarray 
technology. In particular, Troester, Millikan and Perou (2009) discussed the 
potential of model selection strategies in the process. Nevertheless, there 
remain obstacles to fully appreciate the effect of complex sample charac- 
teristics on DE gene detection and the value of improving upon current 
approaches. 

In this paper, we proposed a novel concept for high throughput data 
analysis involving a heterogeneous sample, that is, a multi-model handling 
is intrinsically needed. We presented the theoretical framework that explains 
why basing inferences on a single model could be problematic in observa- 
tional microarray studies. The problem arises from the inadequacy of using 
a single model to describe the complex expression pattern of genes among 
a heterogeneous sample, which can result in increased number of false dis- 
coveries due to bias when a simple model is used or increased random error 
due to reduced efficiency when a complex model is used. Such effects of 
model misspecification are hard to avoid because of the existence of genes 
being affected by different sets of sample characteristics and/or their interac- 
tions. We showed through simulation that the single model approaches have 
inferior performance in DE gene finding in comparison with a multi-model 
approach should we know the right model for the right set of genes. The 
magnitude of effects on false discovery depends on the study design, specific 
biological system and the mechanism underlying expression variation. 

We proposed to use the BMA approach to improve our ability to identify 
DE genes. This approach utilizes the Zellner-Siow prior for model param- 
eters. The consistency property of this prior is important, as it allows for 
obtaining a consistent estimate of the distribution of the genes in the model 
space using Bayes factors. Another choice could be the hyper-g/ra prior pro- 
posed in Liang et al. (2008). We proposed to use an iterative procedure 
to obtain the prior model probabilities so that the estimated distribution 
of the genes among the model space based on posterior model probabili- 
ties matches the estimate based on the Bayes factors. These prior choices 
allow the efficient computation of the Bayes factors and the posterior inclu- 
sion probabilities that does not rely on a MCMC simulation. Our simulation 
study demonstrated that this approach performed almost as well as the gold 
standard multi-model approach with true models and better than the single 
model approaches in gene ranking. The ranking performance was relatively 
insensitive to a wide range of choice for prior model probabilities. However, 
accuracy of the FDR directly estimated from the posterior model/inclusion 
probabilities was sensitive to the prior choice. Our simulation study showed 
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that the proposed empirical prior model probability allowed for reasonably 
good calibration of posterior model/inclusion probabilities for multiplicity 
and the estimated FDR was close to the true FDR in settings with moderate 
to large sample size. In the rare case of a small study with a heterogeneous 
sample, care needs to be taken when using the empirical prior because the 
small sample size property of the Zellner-Siow prior is less certain. Never- 
theless, it should be pointed out that multiplicity control in the Bayesian 
modeling framework remains a challenging and active research area. Further 
studies on the theoretical aspects of the prior choice for multiplicity control 
across the multiple genes and multiple models are needed. The current BMA 
approach is developed under the M-complete assumption, that is, the model 
space contains the true model. Should unknown confounders exist, it is pos- 
sible to capture the latent confounding factors by introducing the surrogate 
variables [Leek and Storey (2007)]. We note, however, it would be unwise to 
directly incorporate the surrogate variables, currently constructed based on 
residuals derived from a single model fit of the data, into the proposed BMA 
approaches. Our work relies on the assumption of linear regression models 
with normal errors, which may be violated in practice. This calls for new 
approaches that are robust to the normality assumption, which is likely to 
be particularly useful for studies with small sample sizes. For the analysis of 
conventional data with one outcome variable, robust Bayesian model selec- 
tion/averaging approaches have been suggested, for example, the approach 
by Gottardo and Raftery (2009). Extending such ideas to the observational 
microarray studies represents an interesting future direction. 

Finally, through the application of the BMA approach to an observational 
mircoarray study with unbalanced study design and one with balanced study 
design, we showed that complex expression patterns did exist when study 
samples were heterogeneous. Previous research has demonstrated the com- 
plexities of underlying biological mechanisms for gene expression variation. 
Genes affected by several common factors, such as age [Tan et al. (2008)], 
gender [Delongchamp et al. (2005); Yang et al. (2006); Tan et al. (2008)], 
smoking [Spira et al. (2004)] and drinking alcohol [Lewohl et al. (2001)], have 
been found in different tissue samples. Our study showed that such complex- 
ity interfered with the DE gene detection. Notably, the BMA approach was 
able to avoid missing important genes whose expression patterns were not 
adequately captured by a single model approach. As an added value, the 
BMA approach is found to be a flexible tool that allows for more compre- 
hensive characterization of the association between gene expression and the 
characteristics of the subjects from whom the samples were obtained. All 
these can be done within a unified and coherent framework. 
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SUPPLEMENTARY MATERIAL 

Supplement to "A Bayesian model averaging approach for observational 
gene expression studies" (DOI: 10.1214/11-AOAS526SUPP; .pdf). Detailed 
description of the simulation setup and simulation procedure and additional 
results from the simulation study and application to the airway epithelium 
microarray study are provided. 
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